Exploring Entropy with Cellular Automata

Alex Deich adeich2@illinois.edu

Entropy is a distinctly muddy subject for me, and I've always been a little confused and curious about how it works for different physical systems. I know that in physics, entropy always increases, but is this true for all systems? How easy is it to make a system whose entropy decreases? How does physics entropy compare with entropy in information theory?

Recently, I was reading about cellular automata1, and I wondered how the entropy of these things changed over time. After all, certain cellular automata (like Conway's Game of Life) are often touted as toy models which eerily recreate a lot of the complexity found in nature. Does the entropy of these models reflect this qualitative property? This turned out to be a poorly-posed question and the process of making it a better question taught me a lot about cellular automata and entropy.

Entropy

I dislike the saying "entropy is a measure of disorder" (why would Nature's idea of disorder align with mine?), and so I read Wikipedia for a while. It turns out that while all definitions of entropy are ultimately equivalent, each notion has a natural interpretation.

Shannon Entropy

The first definition I looked at was Shannon entropy, which was what Claude Shannon devised for studying information theory. He was looking at how strings of text were arriving over a radio, and wanted a way to quantify how much "new information" each successive new character provided to the string.

For example, if you had recieved 6 0's in a row, 000000, there's not very much information. If you were to describe this string, you would say something like "6 0's". This counts as a form of lossless compression: You've taken a message of 6 bits and given a shorter prescription for how to reproduce it exactly2.

Now, If the next digit is a 0, it adds very little information: the compression becomes "7 0's". However, adding a 1, 0000001 represents a relatively large amount of new information. No longer is the compression stating the number of 0's. Instead, it becomes "6 0's, 1 1", which is about twice as big as the original compression. Indeed, if we had a string of 0's and 1's with no discernable pattern, like 1110010, the compression is about the same size as the string itself: "3 1's, 2 0's, 1 1, 1 0".

So how can we quantify the amount of new information that each additional bit provides?

One way to do it would be to take the fraction of the message which is a 0 and the fraction which is a 1. Call them $p_0$ and $p_1$, respectively. We want a quantity which takes these two values and returns something which tracks with our notions of information contained in a string. One way to do this would be to simply that the information contained in a string is $I = 1 - \max{\{p_0, p_1\}}.$

This would appear to work okay: For a string of all 0's or all 1's, this would return 0, which is a quality we're looking for. However, this and many other simple attempts to define entropy fails in an important regard. When we consider the information contributed by two new bits of information, we should expect the total information to be the sum of the two individuals: $I_\text{total} = I(p_0) + I(p_1)$. However, this is tricky! Because considering the two events at once, we see that there are four possible ways to add two bits to the string: 00, 01, 10, and 11. In other words, the new information of the two events taken together (with $p_1 \times p_2$ possibilities) needs to be the same as the entropy of the two events individually, or $I(p_1p_2) = I(p_1) + I(p_2)$. The simplest function with this additive property is the logarithm.

So, our new suggestion is that the information communicated by a new bit is $I = \sum_i \log(p_i)$. Historically, this is not how entropy was defined, though. In the simplest case of two equally probable outcomes (0.5 chance that the next bit is a 0 or a 1), this will return a negative number, which feels wrong: Sure, we could define entropy to be a negative quantity, but the discussion above really wants us to say that a high-entropy message is a big positive number. For this reason, we take the reciprocal of the probability, which is the same as putting a minus sign in front: $\log(p_i^{-1}) = -\log(p_i)$.

Now, if we sample this quantity for $N$ new bits, the we will get that each bit occurs $n_i = N p_i$ times, for individual probability $p_i$ (in the case of a binary string, $i \in \{0,1\}$). So, finally, the average amount of new information conveyed for every new bit of message is $$ S_\text{Sh.}= -\sum_i p_i \log{p_i}. $$ This is the Shannon entropy.

Because Shannon was thinking about bits, where $\sum i = 2$, he chose $\log_2$ so that messages of equal numbers of 0 and 1 (and therefore maximum entropy) would give a value of 1, but this is really just a scaling factor.

In Python, an implementation of this looks like

In [1]:
def shannon_entropy(message):
    tabsize = len(message)
    S = 0
    
    n1 = np.count_nonzero(message)
    n0 = tabsize - n1

    P0 = n0 / tabsize
    P1 = n1 / tabsize
    
    if P0 > 0:
        S += -P0 * np.log2(P0)
    if P1 > 0:
        S += -P1 * np.log2(P1)
    return(S)

# Low-information strings have low entropy
s1 = [1,1,1,1]
print(f"S({s1}) = {shannon_entropy(s1)}")

# Max-information strings have max entropy
s2 = [0,1,0,1,0,1]
print(f"S({s2}) = {shannon_entropy(s2)}")
S([1, 1, 1, 1]) = 0.0
S([0, 1, 0, 1, 0, 1]) = 1.0

Shannon Entropy in Cellular Automata

To explore these ideas in cellular automata (hereafter CA), I've written a Python package I call CALab, whose source you can view. CALab has code for evolving 2D grids of 1's and 0's with periodic boundaries according to a CA rule which depends only on the current state of a grid to determine the next. There are three rules built-in, and the process for adding your own is straightforward.

What is a cellular automaton?

Cellular automata are toy models in which grids of cells can take one of two values, "alive" and "dead". The grids change in time, with their state at a given timestep determined by their state at the previous. The exact rules which are used to determine who's alive and who's dead can be very complicated, but it is often noted how surprisingly simple rules can lead to very complicated structures and phenomena. Conway's Game of Life is one example.

CALab also includes some animation code to animate a CA, along with whatever variable you want to track during its evolution. For example, here's a "glider", a classic persistent pattern in Conway's Game of Life:

In [2]:
from calab import *
import analysis
# calab contains a directory with a bunch of initial conditions
# stored as serialized pickle files
glider = get_ic("ics/life/glider.pkl")
updater = update_rules.game_of_life
animate_ca(glider, updater)
Out[2]:

Game of Life

Conway's Game of Life (hereafter GL) works by applying the following rules to each cell in the grid:

  1. If the cell is alive and has fewer than two living neighbors, or more than three, the cell dies
  2. If the cell is dead and has exactly three living neighbors, the cell is given life. All other cells remain untouched.

These simple rules famously give rise to surprisingly complex structures and patterns. How does the Shannon entropy evolve for various GL configurations? Let's look at the entropy for the glider. Intuitively this should remain constant— There are always 5 living cells (represented as 1's), so the probability that any given cell is alive or dead is the same. Let's check:

In [3]:
animate_ca(glider,
           updater,
           tracking_variables=[analysis.shannon_entropy],
           labels=["Shannon entropy"])
Out[3]:

Indeed it is. What about for a more complicated pattern? Say, a random grid. How does GL's Shannon entropy evolve for that?

In [4]:
animate_ca(random_table((15,15)),
           update_rules.game_of_life,
           nframes=150,
           tracking_variables=[analysis.shannon_entropy],
           labels=["Shannon entropy"],
           verbose=True)
rendering timestep 149
Out[4]:

So already we've found something kind of interesting: Random initial conditions tend to lose entropy in GL. This is the first indication that GL doesn't necessarily model physical systems very well. Physical systems tend to increase in entropy over time, while almost all GL configurations lose information over time3 (see the appendix for discussion). This is an important point. That Conway's Game of Life loses information over time implies two things:

  1. Loss of time-reversibility. If the information at timestep $t_n$ is less than the information at $t_{n+1}$, then you cannot recreate $t_n$ given only $t_{n+1}$. It is thought at the moment that reality is time-reversible.

  2. Entropy will tend to decrease. If entropy is a measure of new information, then entropy in GL will decrease over time, as we have seen. Another way of stating the same idea is: If information is always the same, then as systems change, the information about their previous state must go elsewhere in the system. This information is represented by positions and momenta, so this means that the system will have to occupy more and more states to store this information.

Finally, let's look at the following initial condition, where all of the living cells are grouped together in the middle. This is kind of like a confined gas: In the real world, these living cells would like to diffuse through the rest of the box. This is a classic example of increasing entropy.

In [5]:
circle = get_ic("ics/circle.pkl")
animate_ca(circle,
           updater,
           nframes=40,
           tracking_variables=[analysis.shannon_entropy],
           labels=["Shannon entropy"])
Out[5]:

A rapid loss of entropy! Very bizarre.

So, let's instead look at what happens when we define a new CA rule, one which keeps the amount of information constant (it's said to "conserve information"): The random walk.

The random walk

The random walk CA iterates over each cell in the grid. At every timestep, it executes these rules:

  • If the cell is alive, choose a random direction

    • If the cell in the chosen direction is dead, move the current living cell to that spot, and kill the current cell.
    • If the cell in the chosen direction is alive, keep choosing random directions until a dead one is found.

This is a reasonable model of Brownian motion, which describes how molecules of gasses move.

Let's look at how the Shannon entropy of a random walk CA evolves for an initial condition where all of the particles start out clumped together:

In [6]:
circle = get_ic("ics/circle.pkl")
updater = update_rules.random_walk
animate_ca(circle,
           updater,
           nframes=20,
           tracking_variables=[analysis.shannon_entropy],
           labels=["Shannon entropy"],
           verbose = True)
rendering timestep 19
Out[6]:

The Shannon entropy is constant! We should have expected this: After all, like the glider example, the number of particles (and the amount of information) is staying constant. But then why does a diffusing system in real life grow monotonically in entropy? To see why, we need to talk about entropy as it is understood in statistical mechanics.

Entropy in Statistical Mechanics

At first glance, the new definition of entropy is identical to before:

$$ S_\text{SM} = - \sum_i p_i \log p_i. $$

However, what the $p_i$ represent and how we calculate them is now quite different. Rather than asking what the probability of what the next bit in a message will be, we're asking what the probability is of a given "system" being in a certain "state".

What are systems and states?

A system is a collection of things that are subject to the laws of physics. Two billiard balls constitute a system, as do a bunch of planets orbiting a star, or a single atom. All that matters is that you define what exactly is in the system, so we can be precise about it. A state is a configuration of the system. Where are the planets in their orbits? How fast are they going? Are the billiard balls colliding? What are their momenta?

In physics, different states often require different amounts of energies to reach them. For example, consider a system of a single basketball near the surface of the Earth. In one state, the ball is one meter above the ground. Another state, the ball is 20 meters above the ground. The second state requires much more energy than the first.

Statistical mechanics says that the probability $p_i$ of a system being found in a given state is related to the energy $E_i$ required to get it in that state4 by

$$ p_i \sim e^{-E_i}. $$

So if you have a huge collection of gas around a planet, you're much more likely to find the gas near the surface of the planet than you are far away. This explains why atmospheric density drops off like it does. Now, notice that the $i$ is not counting the two possibilities from before (0 or 1), but is now a potentially gigantic number. The number of possible altitudes a molecule can have above the Earth's surface is an uncountably5 infinite number!

Likewise, if your system is a bunch of gas in a box, then the energy density is lowest if the gas is evenly spread out through the box, rather than concentrated in a corner. Thus, the gas wants to diffuse.

Nearest-neighbor likelihood

So we want to come up with a way to assign probabilities to cells in any CA such that the entropy will evolve like we expect for physical systems. In particular, we want a diffusing system (like modeled above with the random walk CA) to increase in entropy, rather than stay constant, as Shannon entropy dictates.

To do this, I defined something I call "nearest neighbor" (NN) likelihood, which assigns a probability to every cell that it will become alive or dead in the next timestep. This is simply accomplished by looking at how many of that cell's neighbors are alive: If many of them are, then it has a high chance of being alive in the next generation. If none are, then it has no chance. This is creating a sort of "potential energy" for each cell, which is a function of the number of living neighbor cells.

Of course, this is a naïve method: GL, after all, will kill any cell with 8 living neighbors, while random walks won't. However, this is not an important detail; one can change the specific probabilities however one pleases. The important thing is to assign those probabilities on the cell's immediate neighbors, rather than a global calculation, as Shannon entropy does.

Additionally, because I want to compare NN entropy to Shannon, I scale the NN entropy by its maximum possible value, which is the value yielded for a checkerboard. For this reason, Shannon entropy will always be an upper bound on NN entropy.

Let's see how this new definition of entropy compares to the Shannon entropy in a few examples.

In [7]:
# first, a sanity check.  Does a constant grid give 0 entropy?
a = np.ones((8,8))
plt.imshow(a, origin='lower', cmap='gist_yarg')
print(f"nearest neighbor entropy: \n{analysis.nn_entropy(a)}")
nearest neighbor entropy: 
0.0
In [8]:
# Now, how about a grid with a large concentration of 1's?
a = np.zeros((8,8))
a[:,:4] = 1
plt.imshow(a, origin='lower', cmap='gist_yarg')
print(f"nearest neighbor entropy: \n{analysis.nn_entropy(a)}\n")

print(f"shannon entropy: \n{analysis.shannon_entropy(a)}")
nearest neighbor entropy: 
0.4772170014624827

shannon entropy: 
1.0
In [9]:
# Now, a maximum entropy situation: alternating 1's and 0's
a = np.zeros((8,8))
a[1::2,::2] = 1
a[::2,1::2] = 1
plt.imshow(a, origin='lower', cmap='gist_yarg')
print(f"nearest neighbor entropy: \n{analysis.nn_entropy(a)}\n")

print(f"shannon entropy: \n{analysis.shannon_entropy(a)}")
nearest neighbor entropy: 
1.0

shannon entropy: 
1.0
In [10]:
# Now, let's look at how it does the random walk CA from before:
circle = get_ic("ics/circle.pkl")
updater = update_rules.random_walk
animate_ca(circle,
           updater,
           nframes=500,
           tracking_variables=[analysis.shannon_entropy,
                               analysis.nn_entropy],
           labels=["Shannon entropy",
                    "Nearest neighbor entropy"],
           verbose = True)
rendering timestep 499
Out[10]:

And now we have a version of entropy which behaves as we expect it to, which I find very satisfying.

Other Information-conserving systems

So we defined a very simple system which conserves particle number, but it's not truly reversible. Every cell requires a random direction be chosen in order to propagate, so it would be impossible to figure out where a cell came from given only its current state.

There is a huge literature on reversible CA's, as they seem to be useful toy models for various theoretical computer science questions. The one I chose, Critters), is a version of a "block cellular automaton" the details of which aren't important here. I chose it because of its cute name.

Critters is interesting in that it is superficially similar to GL. Many initial conditions will become some kind of oscillating configuration with a short period. The main difference is that when two configurations collide, there will be a large explosion, rather than a generally destructive interaction. It is very fun to watch.

How do Critters deal with the "confined gas" initial condition from earlier?

In [11]:
updater = update_rules.critters
animate_ca(circle,
           updater,
           nframes=150,
           tracking_variables=[analysis.shannon_entropy,
                               analysis.nn_entropy],
           labels=["Shannon entropy",
                    "Nearest neighbor entropy"],
           verbose = True)
rendering timestep 149
Out[11]:

The entropy increases! This is great– it's a confirmation that systems which conserve information also increase in entropy over time.

Conclusion

This was a cool way to relate the two definitions of entropy. It also underscored the way that information in both senses – the 0's and 1's of Shannon, and the systems and states of Boltzmann – is fundamentally the same thing. These cellular automata are great toy models for demonstrating this quite fundamental property of stuff in nature.

One point that's worth making is that Shannon entropy is not useless, but rather that it makes an assumption about the relative likelihood of any given bit which is ill-suited to studying CA in the way that I was interested in. In particular, the Shannon entropy is the same as the Boltzmann entropy for the microcanonical ensemble which takes all microstates to have the same probability.

Finally, and I'm putting this in the conclusion because I wanted to include it but there didn't seem to be a more natural place for it, this topic is adjacent to what is discussed in one of my favorite blog posts by Scott Aaronson.

Footnotes

[1] Sadly, this is because of the recent death of John Conway, who among many, many other things, was famous for his cellular automaton Game of Life.

[2] It is a subtle point, but entropy is not actually the same thing as compressibility. Compressibility is a measure of a string's complexity, the measure of which is a frought and interesting topic. See this paper for a discussion.

[3] This is not necessarily true of all configurations for GL. GL is known to be Turing complete, so it is possible in theory to simulate any system, including one in which information is conserved. It is merely true that all primitive configurations (i.e. the 512 ways for a cell and its neighbors to be arranged) either stay constant or lose information. See the appendix for further details.

[4] It is technically not a state but a "microstate," a large number of which constitute a "macrostate" which has well-defined quantities like temperature or pressure. See Wikipedia for more.

[5] In a classical context. In a quantum context, this number might actually simply be very large, but countably infinite.

Appendix: Game of Life configurations tend to die out

To get a quick idea of how GL configurations survive, we can look at how the "primitive" configurations evolve. By primitive configurations, I just mean every configuration that can occupy a 3x3 grid of cells, because this is the basic unit that GL uses to determine the next state of the center cell.

Most primitive configurations of GL either reach some "ground state" where they repeat over some period, or die out completely. To see this, let's just run through every possible grid of 3x3 cells, and evolve them to see what happens:

In [12]:
# get every possible string of length 9
def all_binary_strings(arr, idx, lst):
    if idx == len(arr):
        lst.append(np.copy(arr))
        return(arr)
    
    arr[idx] = 0
    all_binary_strings(arr, idx + 1, lst)
    
    arr[idx] = 1
    all_binary_strings(arr, idx + 1, lst)

# evolve until they repeat or are dead
def evolve_to_death(init_table, table_num):
    table = np.copy(init_table)
    history = []
    steps = 0
    complete = False
    while not complete:
        print(f"table {table_num}, step {steps}", end="\r")
        new_table = time_evolve(update_rules.game_of_life, table, 1)
        table = new_table
        steps += 1
        if str(table) not in history:
            history.append(str(table))
        else:
            complete = True
            
        # this is just to check that there are no extremely long-lived
        # configurations.  It will turn out that all configurations
        # stop after fewer than 3 steps
        if steps > 100:
            print(init_table)
            break
    newN = np.count_nonzero(table)
    return(newN - np.count_nonzero(init_table), steps)

ninecells = []
all_binary_strings(np.zeros(9), 0, ninecells)
nine_cell_squares = [np.reshape(i, (3,3)) for i in ninecells]
steps = np.zeros(len(nine_cell_squares))
lost, gained, same = [0, 0, 0]
total_steps = []
for i in range(len(nine_cell_squares)):
    
    # put the 3x3 block in a grid big enough
    # that it won't eventually interact with itself
    arr = np.zeros((50, 50))
    arr[25:28,25:28] = nine_cell_squares[i]
    new_cells, steps = evolve_to_death(arr, i)
    total_steps.append(steps)
    if new_cells == 0:
        same += 1
    elif new_cells > 0:
        gained += 1
    elif new_cells < 0:
        lost += 1
print(f"""\nnumber of configurations which lost cells: {lost}
number of configurations which gained cells: {gained}
number of configurations which kept the same number of cells: {same}
maximum number of steps before stopping: {max(total_steps)}""")
table 511, step 1
number of configurations which lost cells: 277
number of configurations which gained cells: 137
number of configurations which kept the same number of cells: 98
maximum number of steps before stopping: 2

To reiterate, this does not mean that all GL configurations will lose therefore entropy. The Turing-completeness of GL guarantees that it can simulate any computable system, so there must be configurations which simulate a system whose entropy grows. Merely that this provides a casual sketch of an explanation as to why most configurations tend towards simple, non-growing states.